Machine Learning Models to Predict In-Hospital Mortality among Inpatients with COVID-19: Underestimation and Overestimation Bias Analysis in Subgroup Populations

Prediction of the death among COVID-19 patients can help healthcare providers manage the patients better. We aimed to develop machine learning models to predict in-hospital death among these patients. We developed different models using different feature sets and datasets developed using the data balancing method. We used demographic and clinical data from a multicenter COVID-19 registry. We extracted 10,657 records for confirmed patients with PCR or CT scans, who were hospitalized at least for 24 hours at the end of March 2021. The death rate was 16.06%. Generally, models with 60 and 40 features performed better. Among the 240 models, the C5 models with 60 and 40 features performed well. The C5 model with 60 features outperformed the rest based on all evaluation metrics; however, in external validation, C5 with 32 features performed better. This model had high accuracy (91.18%), F-score (0.916), Area under the Curve (0.96), sensitivity (94.2%), and specificity (88%). The model suggested in this study uses simple and available data and can be applied to predict death among COVID-19 patients. Furthermore, we concluded that machine learning models may perform differently in different subpopulations in terms of gender and age groups.


Introduction
In spite of more than 2 years since the COVID-19 pandemic and performing vaccination in many countries, the disease's prevalence and mortality have not slowed down, and many countries are still experiencing high peaks [1]. In addition, multiple mutations in the virus have become a new challenge to control the disease, leading to the spread of the disease and increased mortality [2][3][4]. Until April 16, 2022, more than 500 million cases of the disease and more than 6 million deaths due to COVID-19 have been reported globally, with more than 7 million cases and 140,000 deaths in Iran [1].
Since the beginning of the COVID-19 pandemic, one of the most critical challenges for the healthcare systems has been to increase the number of patients with severe symptoms and the growing demand for hospitalization. In developing countries, which do not have sufficient healthcare infrastructure, the increase in inpatients has put a lot of burden on the healthcare system. Moreover, numerous studies have reported various risk factors such as old age, male gender, and underlying medical conditions (such as hypertension, cardiovascular disease, diabetes, COPD, cancer, and obesity) for the deterioration of COVID-19 patients [5][6][7][8][9]. e use of modern and noninvasive methods to triage patients into specific and known categories at the early stages of the disease is beneficial [10]. One of these approaches is the use of predictive models based on machine learning [11,12]. For example, developing predictive models based on mortality risk factors can positively prevent mortality through controlling acute conditions and planning in intensive care units [13]. Furthermore, machine learning can classify patients based on the deteriorating risk and predict the likelihood of death to manage resources optimally [14,15].
To date, several studies have been published on the application of machine learning to develop diagnostic models or predict the death of patients due to COVID-19 [14][15][16][17][18][19][20][21][22][23]. For example, several deep learning models have been reported to diagnose COVID-19 based on images [24]. In a study, researchers developed an enhanced fuzzy-based deep learning model to differentiate between COVID-19 and infectious pneumonia (no-COVID-19) based on portable CXRs and achieved up to 81% accuracy. eir fuzzy model had only three misclassifications on the validation dataset [24].
As for death prediction, several studies have also been published [16,[25][26][27][28]. e results obtained from the studies on machine learning-based predictive methods indicated that those methods had reliable predictability and could identify the correlation between intervening variables in complex and ambiguous conditions caused by  erefore, they can be used to predict such situations in the future. Although those techniques have been tested on some regional datasets of the risk factors, the performance of the models can be improved when they apply to different datasets related to other countries such as Iran, where the prevalence of the COVID-19 and related deaths is high.
Iran is one of the first countries to face a widespread outbreak of the disease and has experienced more than four major epidemic waves with the highest mortality rates [29,30]. As a result, due to the high prevalence and mortality rate of COVID-19 in Iran and the limitation of healthcare resources [31,32], it is vital to have a prediction model based on Iranian conditions and local data. erefore, this study aimed to fit a model for predicting the death caused by COVID-19 based on machine learning algorithms. Many previous models are based on laboratory, imaging, or treatment data [16,[25][26][27][28]; however, we suggested models based on available demographic data, symptoms, and comorbidities that can be easily collected. We also conducted a bias analysis of machine learning models based on subgroups of patient populations to show the bias of these models.

Population and Data.
We extracted data from the Khuzestan COVID-19 registry system belonging to Ahvaz Jundishapure University of Medical Sciences (AJUMS). From the beginning of the pandemic, this registry collects data from suspected (based on clinical signs) and confirmed (based on the results of PCR or CT scan) outpatients and inpatients in Khuzestan province, Iran. is registry collects demographic data, signs and symptoms, patient outcomes, PCR and CT results, and comorbidities from 38 hospitals. e details of data collection and data quality control were published elsewhere [30].
We included only patients with a confirmed diagnosis of COVID-19 based on PCR test or CT scan results for this modeling study. Furthermore, we included only patients who were hospitalized for more than 24 hours. Because outpatients and hospitalized patients with a short stay (less than 24 hours) had a lot of missing data, we excluded these cases from the final analysis. We also included patients from all age groups. Finally, we extracted data for 10,657 patients. e frequency of nonsurviving patients (until discharge) was 1711 (16.06%); 8946 patients (83.94%) were discharged alive. Figure 1 shows the steps of this study.

Imputing Missing Variables.
Because of the data quality controls in the registry, the database had a low rate of missing data. e 28 variables had a missing rate below 4% (Supplement 1, Table S1). In machine learning, data imputation is a standard approach to improve the models' performance. Different methods such as imputation with mean, median, or mode are common. We imputed the missing values with the mean for age and the highest frequency of values for nonnumerical variables as well [11,33].

Features and Feature Selection.
e outcome measure of the study is in-hospital mortality until discharge which is collected as binary (yes/no). e dataset contains 60 input variables. Age and the number of comorbidities are numerical; oxygen saturation level (PO2) includes two values including below and above 93%. We created three dummy variables for the diagnosis method (only positive PCR, only abnormal CT, positive PCR, and abnormal CT). Other variables have two values: yes or no.
For feature selection, we applied univariate analysis using Chi-square or Fisher exact tests for nonnumerical variables and Mann-Whitney U test for age and number of comorbidities (due to abnormal distribution). We created different feature sets to build the prediction models. e first set included all the 60 variables. e second set consisted of variables that were significant in univariate analysis (P value <0.05). e third feature set included the marginal variables based on univariate analysis (P value <0.2). To create the fourth feature set, we used the feature selection node in the IBM SPSS modeler. is node identifies important features based on univariate analysis as well as the frequency of missing values and the percentage of records with the same value. Table 1 shows the variables in each of these feature sets.

Data Balancing.
We first developed our models with a variety of machine learning algorithms on the original dataset (dataset 1). We found the inappropriate performance of these models, in terms of the sensitivity, because of the small number of samples in the death class (83.94% surviving vs. 16.06% nonsurviving, ratio � 5.23), so the models did not perform well to predict death.
ere are various methods such as oversampling the minor class or undersampling the major class to solve this problem [11,12]. We oversampled the death cases to create more balanced datasets. Datasets 2 and 3 included 5,133 (36.5%, ratio � 1.74) and 8,938 (49.98%, ratio � 1) nonsurviving patients, respectively. We developed our models with all four feature sets on these three datasets.
We first developed models based on the default settings of parameters. We developed CHAID decision trees with a maximum depth of five and a minimum record of two in the nodes. Moreover, we implemented the C5 tree with a minimum of two records in nodes. RF was also implemented with a maximum depth of 10, and a minimum of five records in nodes using 100 models. e SVM model was  Journal of Healthcare Engineering implemented with a regularization parameter of 10 and a gamma of 0.1. We additionally developed MLPs using the different number of neurons (5, 10, 15, and 20) in one and two hidden layers and also with the number of neurons suggested by the software. We also implemented the best CHAID, C5, and MLP with boosting ensemble method and 10-fold cross-validation. Furthermore, we implemented stack models (combining individual models) [40]. Our analysis showed that models developed on dataset 3 had generally better performance. erefore, we developed stack models, based on the best individual models, on this dataset with different feature sets.

External Validation.
For external validation, we extracted 1734 records from the Khuzestan COVID-19 registry system. ese data are from four different hospitals in different timeframes. erefore, these data were not used in training or testing the models. is dataset contained 1425 surviving and 309 nonsurviving patients. Inclusion and exclusion criteria were similar to the training/testing dataset, described in Section 2.1. e best performing models selected from the previous step and also ensemble models were validated using this dataset.

Subpopulation Bias Analysis.
Previous studies show that predictive models may have different performances against different subpopulations, for example, in different sex or age groups [41,42]. To assess this effect, we adopted the method suggested by Seyyed-Kalantari et al. ey suggested the use of false-positive rate (FPR) and false-negative rate (FNR) in subpopulations to assess the underdiagnosis and overdiagnosis of machine learning models [41]. We similarly calculated FNR and FPR to assess the underprediction or overprediction of death in our models. To this end, we used the best performing models in external evaluation and the external dataset.

Analysis.
We applied IBM SPSS statistical software version 23 for statistical analysis and IBM SPSS modeler version 18 to develop and evaluate machine learning models. We evaluated and compared the models using confusion matrix, accuracy, precision, sensitivity, specificity, F-score, and Area under the Curve (AUC). To select the best performing models, we compared the models obtained from each dataset-feature with each other based on AUC and F-score.

Ethical Considerations.
is study received ethical approvals from the Ethics Research Committee of Ahvaz Jundishapur University of Medical Sciences (IR.AJUMS. REC.1400.325).

Descriptive Data.
We extracted data for 10,657 patients from the Khuzestan COVID-19 registry [30]. e frequency of nonsurviving patients (until discharge) was 1711 (16.06%); 8946 patients (83.94%) were discharged alive. Table 2 shows that the death due to COVID-19 was significantly higher among men, older patients, and those who have been in contact with infected individuals. In addition, respiratory distress, convulsion, altered consciousness, and paralysis were more common among the nonsurviving patients. Conversely, cough, headache, diarrhea, and dizziness were less prevalent among them. Furthermore, oxygen saturation status was better among the recovered patients versus the dead. Moreover, the comorbidities and risk factors (excluding pregnancy) as well as the intubation,  Journal of Healthcare Engineering oxygen therapy at the beginning of hospitalization, and ICU admission were significantly higher among the dead.

e Machine Learning Algorithms and eir Evaluation.
e results of performing various models with different settings on three datasets and four feature groups are reported as follows.

e Machine Learning Algorithms on Original Dataset 1.
e details on the performance of the models are given in Supplement 1 (Tables S2-S5). e result showed that the lowest and highest accuracy of the models based on the original dataset 1 were 84.52% (RF with 32 features) and 91.12% (Bayesian network with 32 features), respectively. In addition, the minimum and maximum AUC were 0.757 (C5 with 32 features) and 0.914 (Bayesian network with 32 features), respectively. According to the findings, the sensitivity for predicting death based on original dataset 1 was low and between 0.484 (MLP network with 60 features) and 0.775 (RF with 32 features) which indicates that the sensitivity of the models on imbalanced data is not appropriate. Table 3 shows the results of the performance of the top 10 models based on the test data of dataset 1. According to the table, the best two models were the Bayesian network and the CHAID tree on 32 features, respectively. e ROC curve for the best models is presented in Supplementary Figure S1.

3.2.2.
e Machine Learning Algorithms on Dataset 2. e details on the performance of the models based on dataset 2 are given in Supplement 1, Tables S6-S9. e findings showed that the lowest and highest accuracy were 82.64% (MLP with 60 features) and 87.86% (RF with 60 features), respectively. Moreover, the minimum and maximum values of the AUC were 0.888 (MLP with 60 features) and 0.942 (SVM with 60 features), respectively. According to the findings, the sensitivity for predicting death was between 0.658 (MLP network) and 0.861 (CHAID tree with 32 features). e best results obtained for each algorithm based on dataset 2 were shown in Supplementary Figure S2. According to Table 4, SVM and C5 models had the best performance on 60 and 40 features, respectively.

3.2.3.
e Machine Learning Algorithms on Dataset 3. e details on the performance of the models based on dataset 3 are given in Supplement 1, Tables S10-S13. e results showed that the lowest and highest accuracy were 81.27% (CHIAD tree with 32 features) and 92.77% (C5 with 60 features), respectively. Moreover, the minimum and   Figure S3. According to Table 5, the C5 model had the best performance with different features, and SVM with 60 features was also one of the optimal models. Table 6 indicates that the best ensemble model had 89.13% accuracy and 0.961 AUC. However, the comparison of these models with the corresponding individual models (Table 5) shows that C5 models have better performance than these ensemble models, even though these ensemble models are better than other individual models.

External Validation.
We evaluated all ensemble models (Table 6) and the top 10 models developed on dataset 3 (Table 5) using an external dataset. As shown in Table 7, C5 boosting models with feature sets 1 and 2 have better scores.

Subpopulation Bias Analysis.
We selected the four best models based on external validation for subpopulation bias analysis (Supplement 1, Table S14). Figures 2 and 3 show the FPR and FNR of these models. As these figures indicate, most of these models better perform on female patients than male patients. Furthermore, the performance of these models decreases in older patients. As for FPR, Figure 2 indicates that SVM and C5 (feature set 2) have a less biased prediction in terms of gender and age groups. Additionally, Figure 3 shows that C5 (feature set 2) has a less biased prediction.

Comparison of the Models.
A comparison of the models showed that, with the balancing of the data, the sensitivity and AUC increased. However, the accuracy based on dataset 2 decreased, but it also increased based on dataset 3. Furthermore, models with 60 and 40 features performed better. In general, the C5 model with 60 features outperformed the

3.7.
Variable Importance. Figure 4 shows the importance of each variable in the selected model (C5). As indicated, intubation, number of comorbidities, age, gender, respiratory distress, blood oxygen saturation level, ICU admission, cough, unconsciousness, positive PCR, and abnormal CT are considered the most important death predictors by this model.

Discussion
In the first stage of the study, the risk factors for death due to COVID-19 were discovered using univariate analysis. en, based on the important features, different machine learning models were developed to predict death. e results showed significant differences between recovered and nonrecovered  We found that intubation, number of comorbidities, age, gender, respiratory distress, blood oxygen saturation level, ICU admission, cough, unconsciousness, positive PCR, and abnormal CTare the most important death predictors. Other studies showed that age [17,18,23,27,28,43], male gender [43], respiratory disease [16,17], the number of comorbidities [43], and low oxygen saturation [17,18,23,43] increased cases of death due to COVID-19. Some researchers indicate that high blood pressure, heart disease, cancer, kidney disease [16,17], diabetes [18], cerebrovascular diseases [28], smoking [18,23], and asthma [16] increased mortality from COVID-19. However, our model did not consider these factors significant. It is worth mentioning that these risk factors increased the number of comorbidities in a patient and this factor was also considered significant in the C5 model.
We developed various models with different features to predict death from COVID-19. Based on the results, the best  performance was related to the C5 decision tree with 32 features. In the same way, several studies tried to develop machine learning models for predicting death from COVID-19 [16-23, 25-28, 43-45]. Since a variety of variables (demographic, laboratory, radiographic, therapeutic, signs and symptoms, and comorbidities) and datasets are used, it is not easy to compare the studies. For example, some researchers used laboratory data to develop models in addition to other variables [17,23,28,43], and a study applied only laboratory variables [45]. In another study, vital signs and imaging results were used to develop models [23]. However, the variables used in our study were similar to most of the studies. Despite this, a comparison of our study with previous studies showed that the performance of our selected model was better than those models (Table 8). e model developed by Gao et al. [43] has better performance (AUC � 0.976 vs. AUC � 0.972); however, this model was developed with small sample size. In addition, the F-score (F � 0.97) of the model developed by Yan et al. [19] was higher than our selected model. However, Barish et al. [46] showed that Yan's model did not have a good result in the external validation. Khan's model [26] also has a higher F-score than our model. Khan et al. and Gao et al. used unbalanced data; Barish et al. [46] have shown that models developed based on unbalanced data to predict death from COVID-19 may not have accurate results in the real environment.
We found that machine learning models perform differently in subpopulations in terms of gender and age groups. Other studies similarly show that predictive models have different performances in different ethnic groups, genders, and age groups of patients and patients with different insurance [41,42].
erefore, researchers and clinicians should apply these models to different population groups cautiously. Moreover, developing models for different patient groups may be necessary.
e strengths of our model are the use of demographic data, symptoms, and comorbidities that can be easily collected. Despite some previous studies, we did not use laboratory, treatment, and imaging data. It can be considered a limitation. However, we supposed that all patients received almost similar treatments. Moreover, applying models which are developed based on treatment data may be difficult because of changes in patients' treatment. Furthermore, models that depend on laboratory and imaging data require a lot of time and cost to gather these data to use the model in a real clinical environment. A comparison of our study with those that used laboratory and imaging data (Table 8) indicates that our selected model outperforms many of these models. A study also indicated that imaging data did not affect the performance of machine learning models to predict death from COVID-19 [23]. In addition, the data used in our study have been collected from 38 hospitals, which is the strength of the study. A similar study indicated that up to 20% of missing data in COVID-19 studies is acceptable for developing machine learning models [18]; however, the missing rate in our study was under 4%.
Despite the strengths, some limitations should be considered. Firstly, we only analyzed the subpopulation bias based on gender and age groups. Future studies should consider other variables in this analysis. Furthermore, there are several well-established models such as APACHE and SOFA [41,42]. Researchers are recommended to compare the performance of machine learning models with these models to predict deaths from COVID-19.

Conclusions
Different machine learning models were developed to predict the likelihood of death caused by COVID-19. e best prediction model was the C5 decision tree (accuracy � 91.18%, AUC � 0.96, and F � 0.916). erefore, this model can be used to detect high-risk patients and improve the use of facilities, equipment, and medical practitioners for patients with COVID-19.

Data Availability
e data used to support the findings of this study are restricted by the Ethics Research Committee of Ahvaz Jundishapur University of Medical Sciences in order to protect patient privacy.

Conflicts of Interest
e authors declare that there are no conflicts of interest.